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ABSTRACT 

Thifi paper establishes the theoretical framework of b-bit 
minwise hashing. The original minwise hashing method [3] 
has become a standard technique for estimating set simi- 
larity (e.g., resemblance) with applications in information 
retrieval, data management, social networks and computa- 
tional advertising. 

By only storing the lowest b bits of each (minwise) hashed 
value (e.g., b — 1 or 2), one can gain substantial advantages 
in terms of computational efficiency and storage space. We 
prove the basic theoretical results and provide an unbiased 
estimator of the resemblance for any b. We demonstrate 
that, even in the least favorable scenario, using 6=1 may 
reduce the storage space at least by a factor of 21.3 (or 10.7) 
compared to using b — 64 (or 6 = 32), if one is interested in 
resemblance > 0.5. 

Categories and Subject Descriptors 

H. 2.8 [Database Applications]: Data Mining 

General Terms 

Algorithms, Performance, Theory 

Keywords 
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I. INTRODUCTION 

Computing the size of set intersections is a fundamental 
problem in information retrieval, databases, and machine 
learning. Given two sets, Si and S 2 , where 

Si,S 2 Cft = {0,l,2,..,D-l}, 

a basic task is to compute the joint size a = \Si PI S2I, which 
measures the (un-normalized) similarity between Si and S 2 . 



1 This version slightly modified the first draft written in Au- 
gust, 2009. 



The so-called resemblance, denoted by R, provides a nor- 
malized similarity measure: 

Si PI 1S2I a 



R 



where fi = \S\\, /2 = \Si\ 



S1US2I /1+/2- 

It is known that l — R, the resemblance distance, is a metric, 
i.e., satisfying the triangle inequality [3j[7]- 

In large datasets encountered in information retrieval and 
databases, efficiently computing the joint sizes is often highly 
challenging [2j[l4]. Detecting (nearly) duplicate web pages 
is a classical example J3][5]- 

Typically, each Web document can be processed as "a bag 
of shingles," where a shingle consists of w contiguous words 
in a document. Here w is a tuning parameter and was set 
to be w = 5 in several studies [3"U5lll0|. 

Clearly, the total number of possible shingles is huge. Con- 
sidering merely 10 5 unique English words, the total num- 
ber of possible 5-shingles should be D = (10 5 ) 5 = O(10 25 ). 
Prior studies used D = 2 64 [10] and D = 2 40 [3[5]. 

1.1 Minwise Hashing 

In their seminal work, Broder and his colleagues developed 
minwise hashing and successfully applied the technique to 
the task of duplicate document removal at the Web scale 
3,5 . Since then, there have been considerable theoretical 
and methodological developments [l31[7l[l^l4l [l^[2^1[2T] . 

As a general technique for estimating set similarity, minwise 
hashing has been applied to a wide range of applications, 
for example, content matching for online advertising [24], 
detection of large-scale redundancy in enterprise file sys- 
tems |llj . syntactic similarity algorithms for enterprise in- 
formation management [22], compressing social networks [8], 
advertising diversification [13], community extraction and 
classification in the Web graph [9], graph sampling [23] . 
wireless sensor networks [18J . Web spam 26,17 , Web graph 
compression [BJ, text reuse in the Web Q], and many more. 

Here, we give a brief introduction to this algorithm. Suppose 
a random permutation ir is performed on Q, i.e., 



7r : fl — > Q, where fi = {0, 1, D ■ 
An elementary probability argument can show 

ISin&l 



Pr (min(ir(Si)) = min{n(S 2 ))) 



\s-ius 2 \ 



1} 



R. 



(1) 



After k minwise independent permutations, denoted by 7Ti, 
7T2, 7i~fc, one can estimate R without bias, as a binomial 
probability, i.e., 

k 

Am = \ E Hmm(^(Si)) = min(^(S 2 ))}, (2) 

(3) 



Var(fl M ) = 



Throughout the paper, we will frequently use the term "sam- 
ple," corresponding to the term "sample size" (denoted by 
k). In minwise hashing, a sample is a hashed value, e.g., 
min(7Tj -(Si)), which may require e.g., 64 bits to store [10] . 
depending on the universal size D. The total storage for 
each set would be bk bits, where 6 = 64 is possible. 

After the samples have been collected, the storage and com- 
putational cost is proportional to 6. Therefore, reducing the 
number of bits for each hashed value would be useful, not 
only for saving significant storage space but also for consid- 
erably improving the computational efficiency. 

1.2 Our Main Contributions 

In this paper, we establish a unified theoretical framework 
for b-bit minwise hashing. Instead of using b — 64 bits [10] 
or 40 bits 3,5, our theoretical results suggest using as few 
as b = 1 or b = 2 bits can yield significant improvements. 



In 6-bit minwise hashing, a "sample" consists of b bits only, 
as opposed to e.g., 64 bits in the original minwise hashing. 

Intuitively, using fewer bits per sample will increase the esti- 
mation variance, compared to , at the same "sample size" 
k. Thus, we will have to increase k to maintain the same 
accuracy. Interestingly, our theoretical results will demon- 
strate that, when resemblance is not too small (e.g., R > 0.5, 
the threshold used in [U[5]), we do not have to increase k 
much. This means that, compared to the earlier approach, 
the b-bit minwise hashing can be used to improve estima- 
tion accuracy and significantly reduce storage requirements 
at the same time. 

For example, when b — 1 and R = 0.5, the estimation vari- 
ance will increase at most by a factor of 3 (even in the least 
favorable scenario). This means, in order not to lose accu- 
racy, we have to increase the sample size by a factor of 3. 
If we originally stored each hashed value using 64 bits [10] . 
the improvement by using 6=1 will be 64/3 = 21.3. 

2. THE FUNDAMENTAL RESULTS 

Consider two sets, Si and S2, 

Si,S 2 Cf! = {0,l,2,..,fl-1}, 

fi = \Si\, fi = |&|, a — I Si n S2I 



Apply a random permutation n on Si and S2: n : £1 
Define the minimum values under tv to be Zi and Z2: 



CI 



Zl 



min (71 (Si)) , Z2 = min (71 (S2)) • 



of Z2 ■ Theorem [T] derives the analytical expression for Eb 
E b = Pr m l{ei,i = e2,<> = lj . 



(4) 



Theorem 1. Assume D is large. 

Pr [J] 1 {e M = e 2 ,i} = 1 j = Ci. b + (1 - C 2 ,b) R (5) 



where 



h h 

Ci h = An ; h ^2,6- 



C 2 , b = Ai 



n + r 2 
ri 



+ A 2 



n + r 2 ' 
r 2 



ri + r-2 ' ri + r-2 ' 

2 b -l 



Ai. b = 



A 2>b = 



n [1 - n 



2 b ! 



l-[l-n 

n i2 6 -i 
r-2 [1 - r 2 \ 

1 - [1 - r 2 



2 1 ' 



(6) 
(7) 
(8) 

(9) 
(10) 



For a fixed rj (where j g {1,2},), Aj : b is a monotonically 
decreasing function of 6 = 1, 2, 3, .... 

For a fixed 6, Aj^ is a monotonically decreasing function of 
rj £ [0,1], with the limit to be 

lim Aa h = —r . 

Tj^Q J ' 

Proof: See ApyendixtMo 



2 6 ' 



(11) 



Theorem [T] says that, for a given 6, the desired probability 
P| is determined by 7? and the ratios, ri = 6- and r2 = jy. 
The only assumption needed in the proof of Theorem Q] is 
that D should be large, which is always satisfied in practice. 

Aj t b {j G {1, 2}) is a decreasing function of rj and Aj.b < 
As 6 increases, Aj$ converges to zero very quickly. In fact, 
when 6 > 32, one can essentially view Aj.b = 0. 

2.1 The Unbiased Estimator 

Theorem [T] naturally suggests an unbiased estimator of R, 
denoted by Rb'. 



Rb — 



Eb — Ci, ( 
1 - C 2 ,b 

k ( b 



^ = iXj|n i {ei li ^=e 2 , il . j } = lJ 



(12) 



(13) 



where ei t i t1v , (e2,i,»,) denotes the ith lowest bit of Zi (z%) 
under the permutation nj. 



Following property of binomial distribution, we obtain 
Var (Rb) 



Var(£ M lEb(l-Eb) 



Define ei f j = ith lowest bit of Zi, and e2,i = ith lowest bit 



[1 - C 2 , b ] 2 k[i- C 2 , b ] 2 
_ 1 [Ci.b + (1 - C 2 ,b)R] [1 - Ci.t - (1 - C 2 ,b)R] 
'k [1-C 2 , 6 ] 2 



(14) 



For large 6 (i.e., A iib , A 2ib — > and Ci t b, C 2 , b — > 0), Var (^Rb^j 

converges to the variance of Rm, the estimator for the orig- 
inal minwise hashing: 



lim Var R 



R{\ - R) 



Var R 



is a monotonically increasing function of R G [0, 1]. 

If R — > 1 (which implies r\ — + r 2 ), then 

B(b\; R,r 1 ,r 2 ) bi 1 - A lt h 2 



B( 



;R,ri,r 2 ) 



b 2 1 - A-l , 



(17) 



2.2 The Variance-Space Trade-off 

As we decrease b, the space needed for storing each "sample" 
will be smaller; the estimation variance (|14|l at the same 
sample size k, however, will increase. 

This variance-space trade-off can be precisely quantified by 
the storage factor B(b; R,ri,r 2 ): 



(15) 



B{b;R,r 1 ,r 2 ) = b x Var ^R b J x k 
_ b [Ci.t + (1 - C 2 ,b)R] [1 ~ Ci.fc - (1 - C 2 .i,)fl] 

Lower B(6) values are more desirable. 



Figure [T] plots B(b) for the whole range of R 6 (0,1) and 
four selected n = r 2 values (from 10 -10 to 0.9). Figure [T] 
shows that when the ratios, ri and r 2 , are close to 1, it is 
always desirable to use 6=1, almost for the whole range of 
R. However, when n and r 2 are close to 0, using 6 = 1 has 
the advantage when about R > 0.4. For small R and n, r 2 , 
it may be more advantageous to user lager 6, e.g., 6 > 2. 
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Figure 1: B(b; R,n,r 2 ) in (|15|). The lower the better. 

The ratio of storage factors, afe'fl'ri'r^) ' directly measures 
how much improvement using 6 = 6 2 (e.g., 6 2 = 1) can have 
over using 6 = 6i (e.g., 6i = 64 or 32). 

Some algebraic manipulation yields the following Theorem. 



Theorem 2. Ifn = r 2 and 6i > 6 2 , then 
B(b 1 \R,r 1 ,r 2 ) _ 6i A 1M (1 - R) + g 1 - Ai, 6a 



B{b 2 ;R, ri ,r 2 ) b 2 A ltb2 (l - R) + R l-Ai M 



(16) 



7/n = r 2 , b 2 = 1, 6i > 32 (hence we treat Ai t b =0), then 
5(6i;fl,ri,r 2 ) 

B(l;fl,n,r2) 'ii+l-ri 1 J 

Proof: We omit the proof due to its simplicity. □ 



Suppose the original minwise hashing used 6 = 64 bits to 
store each sample, then the maximum improvement of the 
6-bit minwise hashing would be 64-fold, attained when r\ = 
r 2 = 1 and R = 1, according to (|18p . In the least favorable 
situation, i.e., r±,r 2 — > 0, the improvement will still be -^pj- 
fold, which is ^ = 21.3-fold when R = 0.5. 

Figure [2] plots , to directly visualize the relative im- 

provement. The plots are, of course, consistent with what 
Theorem [2] would predict. 
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Figure 2: , the relative storage improvement of 

using b — 1, 2, 3, 4 bits, compared to using 32 bits. 



3. EXPERIMENTS 

We conducted three experiments. The first two experiments 
were based on a set of 2633 words, extracted from a chuck of 
MSN Web pages. Our third experiment used a set of 10000 
news articles crawled from the Web. 

Our first experiment is a sanity check, to verify the correct- 
ness of the theory. That is, our proposed estimator Rb, (|12[1 . 
is unbiased and its variance (the same as the mean square 
error (MSE)) follows the prediction by our formula in (|14|l . 

3.1 Experiment 1 



For our first experiment, we selected 10 pairs of words to val- 
idate the theoretical estimator Rm and the variance formula 
Var (^Rm^J , derived in in Sec. 12.11 

Table [T] summarizes the data and also provides the theo- 
retical improvements gfff and -gny-. For each word, the 
data consist of the document IDs in which that word oc- 
curs. The words were selected to include highly frequent 
word pairs (e.g., "OF-AND"), highly rare word pairs (e.g., 
"GAMBIA-KIRIBATI"), highly unbalanced pairs (e.g., "A- 
Test"), highly similar pairs (e.g, "KONG-HONG"), as well 
as word pairs that are not quite similar (e.g., "LOW- PAY"). 



Table 1: Ten pairs of words used in the experiments for 
validating the estimator and theoretical variance (114D . 
Since the variance is determined by n, rg, and R, words 
were selected to ensure a good coverage of scenarios. 



Word 1 


Word 2 




T-2 


Ft 


±)(32) 
B(l) 


HI (Si) 
B(l) 


KONG 


HONG 


0.0145 


0.0143 


0.925 


15.5 


31.0 


RIGHTS 


RESERVED 


0.187 


0.172 


0.877 


16.6 


32.2 


OF 


AND 


0.570 


0.554 


0.771 


20.4 


40.8 


GAMBIA 


KIRIBATI 


0.0031 


0.0028 


0.712 


13.3 


26.6 


UNITED 


STATES 


0.062 


0.061 


0.591 


12.4 


24.8 


SAN 


FRANCISCO 


0.049 


0.025 


0.476 


10.7 


21.4 


CREDIT 


CARD 


0.046 


0.041 


0.285 


7.3 


14.6 


TIME 


JOB 


0.189 


0.05 


0.128 


4.3 


8.6 


LOW 


PAY 


0.045 


0.043 


0.112 


3.4 


6.8 


A 


TEST 


0.596 


0.035 


0.052 


3.1 


6.2 



We estimate the resemblance using the original minwise hash- 
ing estimator Rm and the 6-bit version for 6=1,2, 3. 

Figure [3] presents the estimation biases for selected 4 word 
pairs. Theoretically, the estimator Rb is unbiased. Figure 
13] verifies this fact as the empirical biases are all very small 
and no systematic biases can be observed. 
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Figure 3: Biases, empirically estimated from 25000 
simulations at each sample size k. "M" denotes the 
original minwise hashing. 



Figure [4] plots the empirical mean square errors (MSE = 
variance + bias 2 ) and the theoretical variances (in dashed 



lines), for all 10 word pairs. However, all dashed lines over- 
lapped with the corresponding solid curves. This figure sat- 
isfactorily illustrates that the variance formula (I14p is accu- 
rate and Rb is indeed unbiased (because MSE=variance). 
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Figure 4: MSEs, empirically estimated from sim- 
ulations. "M" denotes the original minwise hash- 
ing. "Theor." denotes the theoretical variances of 
Var(Rb) and Var(R,M)- Those dashed curves, how- 
ever, are invisible because the empirical results over- 
lapped the theoretical predictions. At the same 
sample size k, we always have Var(Ri) > Var(R,2) > 
Var(Rz) > Var(Riii). However, Ri only requires 1 bit 
per sample while R2 requires 2 bits, etc. 



3.2 Experiment 2 

This section presents an experiment for finding pairs whose 
resemblance values > Ro. This experiment is in the same 
spirit as [3][5]. We use all 2633 words (i.e., 3465028 pairs) as 
described in Experiment 1. We use both Rm and Rb (6 = 
1,2,3,4) and then present the precision and recall curves, 
at different values of thresholds Ro and sample sizes k. 
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Figure 5: Precision & Recall, averaged over 700 rep- 
etitions. The task is to retrieve pairs with R > Ro. 



Figure [5] presents the precision and recall curves. The recall 
curves (right panels) do not well differentiate Rm from Rb 
(unless 6 = 1). The precision curves (left panel) show more 
clearly that using b — 1 may result in lower precision values 
than Rm (especially when Ro is small), at the same sample 
size k. When 6 > 3, Rb performs very similarly to Rm- 

We stored each sample for Rm using 32 bits, although in 
real applications, 64 bits may be needed [3j[5][T0]. Table [2] 
summarizes the relative improvements of Rb over Rm, in 
terms of bits, for each threshold Ro. Not surprisingly, the 
results are quite consistent with Figure [2] 

Table 2: Relative improvement (in space) of Rb (using 
b bits per sample) over Rm (32 bits per sample). For 
precision = 0.7, 0.8, we find the required sample sizes 
(from Figure [SJ for Rm and Rb and use them to estimate 
the required storage in bits. The values in the table are 
the ratios of the storage bits. For 6=1 and threshold 
Ro > 0.5, the improvements are roughly 10 ~ 18-fold, con- 
sistent with the theoretical predictions in Figure [2] 
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6 = 1 2 


3 


4 


0.3 


6.49 


6.61 


7.04 


6.40 


7.96 


7.58 


6.58 


0.1 


7.86 


8.65 


7.48 


6.50 


8.63 8.36 


7.71 


6.91 


0.5 


9.52 


9.23 


7.98 


7.24 


11.1 9.85 


8.22 


7.39 


0.6 


11.5 


10.3 


8.35 


7.10 


13.9 10.2 


8.07 


7.28 


0.7 


13.4 


12.2 


9.24 


7.20 


14.5 12.3 


8.91 


7.39 


0.8 


17.6 


12.5 


9.96 


7.76 


18.2 12.8 


9.90 


8.08 



3.3 Experiment 3 

To illustrate the improvements by the use of b-bit minwise 
hashing on a real-life application, we conducted a duplicate 
detection experiment using a corpus of 10000 news docu- 
ments (49995000 pairs). The dataset was crawled as part 
of the BLEWS project at Microsoft [12]. In the news do- 
main, duplicate detection is an important problem as (e.g.) 
search engines must not serve up the same story multiple 
times and news stories (especially AP stories) are commonly 
copied with slight alterations/changes in bylines only. 

In the experiments we computed the pairwise resemblances 
for all documents in the set; we present the data for retriev- 
ing document pairs with resemblance R > Ro below. 

We estimate the resemblances using Rb with b = 1, 2, 4 bits, 
and the original minwise hashing (using 32 bits) . Figure [6] 
presents the precision & recall curves. The recall values are 
all very high (mostly > 0.95) and do not well differentiate 
various estimators. 

The precision curves for Ri (using 4 bits per sample) and 
Rm (using 32 bits per sample) are almost indistinguishable, 
suggesting a 8-fold improvement in space using 6 = 4. 

When using b = 1 or 2, the space improvements are nor- 
mally around 10-fold to 15- fold, compared to Rm, especially 
for achieving high precisions (e.g., > 0.9). This experiment 
again confirms the significant improvement of the 6-bit min- 
wise hashing using 6 = 1 (or 2). 



1 

0.9 
0.8 

c °' 7 
.9 0.6 
y) „ ^ 
o 0.5 

£o.4 

0.3 

0.2 

0.1 

0, 



































— b=i 




// R o= 03 




— b=2 
---b=4 
— M 







100 200 300 
Sample size (k) 



1 

0.9 
0.8 
0.7 

g 0.5 
DC 0.4 
0.3 
0.2 
0.1 
0. 



— b=1 
— b=2 
---b=4 

— M 







100 200 300 400 
Sample size (k) 




100 200 

Sample size (k) 



100 200 300 400 
Sample size (k) 



1 

0.9 
0.8 

c 0- 7 
.9 0.6 

| 0.5 
£0.4 
0.3 
0.2 
0.1 
0, 











































If 




















— b=i 












— b=2 




R =0.6 








•-•b=4 

— M 







100 200 300 400 500 
Sample size (k) 



1 

0.9 
0.8 
0.7 
75 0-6 

03 

S 0.5 
0.4 
0.3 
0.2 
0.1 

o. 



— b=i 
— b=2 
---b=4 

— M 







100 200 300 400 
Sample size (k) 



Figure 6: News Precision & Recall. The task is to 
retrieve news article pairs with R > Rq. 



Note that in the context of (Web) document duplicate de- 
tection, in addition to shingling, a number of specialized 
hash-signatures have been proposed, which leverage prop- 
erties of natural-language text (such as the placement of 
stopwords 25 ). However, our approach is not aimed at 
any specific datesets, but is a general, domain-independent 
technique. Also, to the extent that other approaches rely on 
minwise hashing for signature computation, these may be 
combined with our techniques. 

4. DISCUSSION: COMBINING BITS FOR 
ENHANCING PERFORMANCE 

Figure [T] and Figure [2] have shown that, for about R > 0.4, 
using b — 1 always outperforms using b > 1, even in the least 
favorable situation. This naturally leads to the conjecture 
that one may be able to further improve the performance 
using "6 < 1", when R is close to 1. 

One simple approach to implement "6 < 1" is to combine 



two bits from two permutations. 

Recall ei,i l7r denotes the lowest bit of the hashed value under 
jr. Theorem [T] has proved that 

Ex = Pr (e Ml , = ea.i,*) = Ci.i + (1 - C 2 ,i) R 



Consider two permutations 7ri and 7r 2 . We store 

xi = XOR(ei,i j7ri , ei,i l7r2 ), x 2 — XOR(e 2 ,i j7ri , e 2> i l7r2 ) 

Then x\ = x 2 either when ei,i j7ri = e 2j i l7ri and ei,i j7r2 = 
e 2 ,i,7r 2 , or, when ei,i l7ri / e 2i i,^ 1 and ei,!,,^ / e 2 ,i,^ 2 . Thus 



T = Pr (xi = x 2 ) = E\ 



which is a quadratic equation with solution 
V2T -1 + 1 - 2Ci,i 



R = 



2 - 2C 2 ,1 



(19) 



(20) 



We can estimate T without bias as a binomial. However, the 
resultant estimator for R will be biased, at small sample size 
k, due to the nonlinearity. We will recommend the following 
estimator 



Ri 



/2 



max{2T - 1,0} + 1 - 2C*i,i 
2 - 2C 2 ,i ' 



(21) 



The truncation max{,0} will introduce further bias; but it 
is necessary and is usually a good bias- variance trade-off. 

We use R1/2 to indicate that two bits are combined into one. 
The asymptotic variance of R1/2 can be derived using the 
"delta method" in statistics: 



Var i?! 



1/2 



I ni-T) +0 (i ] f22) 

fc4(l-C7 2 , 1 ) 2 (2T-l) + \k 2 ' ' 1 ' 



One should keep in mind that, in order to generate k samples 
for R1/2, we have to conduct 2 x k permutations. Of course, 
each sample is still stored using 1 bit, despite that we use 
"b = 1/2" to denote this estimator. 



Interestingly, as R — » 1, i?i/ 2 does twice as well as R\: 



Var 



lim 



(*) 



2(1 -2£i) 



Var Rx 



1/2 



lim , 
fl-^i (1 - £1 



El 



= 2. 



(23) 



Recall, if R = 1, then ri = r 2 , C1.1 = C 2 .i, and E\ — 
C*i,i + 1 - C 2 ,i = 1. 

On the other hand, J?i/ 2 may not be an ideal estimator when 
R is not too large. For example, one can numerically show 
that (as k — > 00) 



Var (#1) < Var ^ 1/2 ^) , if R < 0.5774, r u r 2 







Figure [7] plots the empirical MSEs for four word pairs in 
Experiment 1, for R1/2, Ri, and Rm- 



For the highly similar pair, "KONG-HONG," R 1/2 ex- 
hibits superb performance compared to Ri. 
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Figure 7: MSEs for comparing R1/2 with Ri and Rm- 

Due to the bias, the theoretical variances Var LR1/2J , 

i.e., (|22fl . deviate from the empirical MSEs when the 
sample size k is not large. 



• For the fairly similar pair, "OF-AND," R1/2 is still con- 
siderably better. 

• For "UNITED-STATES," whose R = 0.591, R 1/2 per- 
forms similarly to R\. 

• For "LOW-PAY," whose R = 0.112 only, the theoreti- 
cal variance of R1/2 is very large. However, owing to 
the variance-bias trade-off, the empirical performance 
of R1/2 is not too bad. 

In a summary, while the idea of combining two bits is in- 
teresting, it is mostly useful in applications which only care 
about pairs of very high similarities. 

5. CONCLUSION 

The minwise hashing technique has been widely used as a 
standard approach in information retrieval, for efficiently 
computing set similarity in massive data sets (e.g., duplicate 
detection). Prior studies commonly used 64 or 40 bits to 
store each hashed value, 

In this study, we propose the theoretical framework of b- 
bit minwise hashing, by only storing the lowest b bits of 
each hashed value. We theoretically prove that, when the 
similarity is reasonably high (e.g., resemblance > 0.5), using 
b — 1 bit per hashed value can, even in the worse case, gain a 
space improvement by 10.7 ~ 16-fold, compared to storing 
each hashed value using 32 bits. The improvement would 
be even more significant (e.g., at least 21.3 ~ 32-fold) if the 
original hashed values are stored using 64 bits. 

We also discussed the idea of combining 2 bits from different 
hashed values, to further enhance the improvement, when 
the target similarity is very high. 

Our proposed method is simple and requires only minimal 
modification to the original minwise hashing algorithm. We 
expect our method will be adopted in practice. 
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APPENDIX 

A. PROOF OF THEOREM 1 

Consider two sets, Si and 5*2, 

Si,S 2 C!l = {0, 1,2, ...,£>- 1}, 
fi = \Si\, f2 = \S 2 \, a=\s 1 ns 2 \ 



which can be decomposed to be 



Pr ^FJ l{ei,i = e 2 ,i} = 1, zi = z 2 ^j 
+Pr ^FJ H e i,i = e 2,J = 1, «i ¥= 22 j 
=Pr (21 = z 2 ) + Pr ( FJ l{ei,i = e 2j! } = 1, zi + z 2 



R + Pr FJ l{ei,i = e 2 ,,} = 1, 21/ 



22 



where R = \s\ust\ ~ ^ r ( Zl ~ 2:2 ) * s tne resem blance. 
When b — 1, the task boils down to estimating 
Pr (ei,i = e 2 ,i, 21 / z 2 ) 

= E E Pr(2i=i,2 2 =j) i 

»=0,2,4,... [3^,3=0,2,4,... J 

+ Yl \ Y Pr(zi=i,Z2=j)\ . 

1=1,3,5,... [j^i,j=l,3,5,... J 

Therefore, we need the following basic probability formula: 

Pr (21 =i, z 2 = j, i^j). 
We will first start with 

Pi+P 2 



where 
Pi = 

Pi = 

P 2 = 



Pr (21 =i, 2 2 = j, i < j) 



D\ D-aUD-fi 
a J \fi-aj \ J2-a J 

D - j - l\(D - j - 1 - a\ ID - i - 1 - h 
a )\ f 2 -a-l )\ fi-a-l 

D-j-l\(D-j-a\(D-i-l-f 2 
o-l { f 2 -a { fi-a-1 



We apply a random permutation tt on Si and S 2 : 
7T : O — > n. 

Define the minimum values under n to be 21 and 2 2 : 
21 = min (71 (Si)) , z 2 = min (71 (S 2 )) . 

Define ei,j = ith lowest bit of zi, and e 2 ,i = iih lowest bit 
of 2 2 . The task is to derive the analytical expression for 

Pr (j[ l{ei,i = e 2 ,i} = lj , 



The expressions for Pi, P 2 , and P3 can be understood by 
the experiment of randomly throwing fi + f 2 — a balls into 
D locations, labeled 0, 1, 2, D — 1. Those fi+fa — a balls 
belong to three disjoint sets: Si — Si H S 2 , S 2 — Si n S 2 , 
and Si PI S 2 . Without any restriction, the total number of 
combinations should be P3. 

To understand Pi and P3, we need to consider two cases: 

1. The jth element is not in Si n S 2 Pi . 

2. The jth element is m Si n S 2 => P 2 . 



The next task is to simplify the expression for the probability 
Pr (21 = i, z 2 — j, i < j). After conducing expansions and 



cancelations, we obtain 



Therefore, 



Pr (zi =i, z 2 = j, i < j) 

_ [\ + 7Z=Z) (a-l)!(/!-a- 



Pl+P2 



(g-j-l)!(g-i-l-/ 2 )! 



l)(/ 2 -o-l)!(D-j-/ 2 )!(r>-i-/ 1 -/ 2 +o)! 



a!(/i-o)!(/ 2 -o)!(£>-/i-/ 2 +o)! 

/ 2 (/i - a)(/J - j - 1)\(D -f 2 -i- m£ - fi - $2 + «)! 
D\(D-f 2 -jy.(D-f 1 -f 2 +a-iy. 

. /2(/i - a) UlZr 2 (D -h-i-l-t) Utl(D -h-h + a- 



ULo(D-t) 



H h - a ' tV D - h - i - 1 - t W D - /1 - h + a - t 



= J2 ji-u T-r u - JJ. 
D D- 1 -U D 



n 



For convenience, we introduce the following notation: 



ri= D> r2= D> S= D- 



Also, we assume D is large (which is always satisfied in prac- 
tice) . Thus, we can obtain a reasonable approximation: 

Pr (zi =i, 22 = j, i < j) 
=r 2 (n - s) [1 - r 2 ] J_l_1 [1 - (ri + r 2 - s)] 1 

Similarly, we obtain, for large D, 

Pr (21 = i, z 2 = j, i > j) 
=ri(ra - a) [1 - ri^'" 1 [1 - (n + r 2 - s)] j 



Now we have the tool to calculate the probability 
Pr (ei,i = e 2 ,i, 21 / 2 2 ) 

= E E Pr (2! = z, 2 2 = j) \ 

»=0,2,4,- [j^»,j=0,2,4,... J 

+ E E Pr (2! = i, 2 2 = j) I 

»=1,3,5,... [j^»,j=1,3,B,... J 

For example, (again, assuming D is large) 

Pr (21 = 0,22 = 2,4,6,...) 
=r 2 (n - s) ([1 - r 2 ] + [1 - r 2 f + [1 - r 2 ] 5 + ... 
1 — r 2 



=r 2 (n - s) 



l-[l-r 2 ] 2 



E \ E Pr(*i=i,*a=j) 

i=0, 2 ,4,... li<j,j=0, 2 ,4,... 

+ E E Pr(2 1 = l ,2 2 =j)i 

i=l,3,B,... l»<j,j=l,3,5,... J 

, ^ 1 - r 2 
=r2(ri - S) l-[l-, 2 p X 
t (1 + [1 - (ri + r 2 - s)] + [1 - (ri + r 2 - s)f + ...) 

=r 2 (ri - s)- r- ■ . 

1 — [1 — r 2 \ T 'i + T '2 — s 

By symmetry, we know 

E j E Pr( Zl =i, Z2 =j)\ 

3=0,2,4,... Li>3,i=0,2,4,... J 

+ E j E Pr (^ 

j=l,3,5,— M>j,i=l,3,5,— 



' J, ^2 = j) 



=ri(r 2 - s)- 



1 — ri 



1 — [1 — ri] 2 n + r 2 — s ' 
Combining the probabilities, we obtain 

Pr (ei,i = e 2 ,i, 21 / 2 2 ) 
_ r 2 (l — r 2 ) n — s n(l — ri) r 2 — s 



1 — [1 — r 2 ] 2 ri + r 2 — s 1 — [1 — ri] 2 n + r 2 — s 



r 2 - s n - s 

=All ; VA 2 A- 



" r\ + r 2 — s n + r 2 — s 



where 



ri [1 - ri] 2 



A i.» = - fi 7*T' A ^ b = 

1 - [1 - n] 



r 2 [l-r,] 2 "- 1 
l-[l-r 2 f 



Therefore, we can obtain the desired probability, for b — 1, 



Pr I} l{ei,< = e 2li } = 1 



^ a r 2 — s . n — s 
-R + Ai 1 +^2,1- 



ri + r 2 — s ' n + r 2 — s 

= R +A 1 , 1 , f2 : a +a 2 ,i- fi - a 



fi + h - a ' fi + f 2 - a 



fi + h - Tfsifi + h) ' h+h-a 

n +72 ji + j 2 

=Ci,i + (1 - C 2 ,i)R 



where 



Pr( 2l = l,z 2 = 3,5,7,...) 
=r 2 (ri - s)[l - (n + r 2 - s)] ([1 - r 2 ] + [1 - r 2 ] 3 + [1 - r 2 ] 5 + 

1 — T2 

=r 2 {n - s)[l - (n + r 2 - s)] 1 _ ^ _ r ^ 2 ■ 



. r 2 . r\ 
Oi,6 = Ai h ; h A 2 ,b ; 

• n + r 2 n + r 2 

C-2,6 = Al } b ■ h ^2,6- 



ri + ?*2 ' ri + 'r 2 
•To this end, we have proved the main result for 6=1. 

The proof for the general case, i.e., b = 2,3,..., follows a 



similar procedure: 



Pr (j[ l{ei,i = e 2 ,i} = 1 j 

r> , a r 2 - s ri— s 

=R + A 1}b ■ h -42.6 ; 

ri + T2 — a ' n + f'2 — s 

=Ci, b + (1 - C 2 , b )R. 



The final task is to show some useful properties of Ai ; b (same 
for A 2t b)- The first derivative of A ltb with respect to b is 



dA 



l b ri[l-ri] 2i> - 1 log(l-ri)log2(l- [1-n] 2 ") 



, 2 



db (l_[l_ n ]2^ 

-[1 - nf log(l - n)log2 n (l - [1 - n] 2 "- 1 ) 

<0 (Note that log(l - n) < 0) 

Thus, Ai ; b is a monotonically decreasing function of b. 

Also, 



ri-»o ' ri-o 2 6 [1 - rip- 1 2 t: 
and 

cM M _ [l-r 1 ] 2t - 1 -r 1 (2"-l) [1-nf' 2 

dn (l - [l - n] 2b ) 

2 b [l-r 1 f- 1 r 1 [l-r-tf- 1 



(i-[i- ri F) 2 

= JLlIltl^ ( 1 - 2»n - [1 - n]* ) < 0. 

Note that (1 — :r) c > 1 — ca;, for c > 1 and x < 1. 

Therefore -4i : i, is a monotonically decreasing function of ri. 
We complete the whole proof. 



